Fix: ESP32 KISS modem becomes permanently unresponsive under TX backpressure#2646
Fix: ESP32 KISS modem becomes permanently unresponsive under TX backpressure#2646agessaman wants to merge 2 commits into
Conversation
…fer stalls. Introduce frame write management and update related methods for improved handling of serial communication. Adjust platform configurations for USB-CDC to enhance reliability on ESP32.
…nd timeout handling for USB-CDC. Enhance documentation for better understanding of frame management and serial communication flow.
| } | ||
| } | ||
|
|
||
| void KissModem::writeFrame(uint8_t type, const uint8_t* data, uint16_t len) { |
There was a problem hiding this comment.
Just wondering if a better approach is to pre-calc the total frame length, then introduce a new method like:
bool canWriteFrame(size_t len);
And for logic to be:
writeFrame( ... ) {
size_t total_len = ... ;
if (!canWriteFrame(total_len)) return; // bail, all or nothing
_serial.write( ... ); ...
}
There was a problem hiding this comment.
I was thinking about the same, but looks like those buffers are quite small (like 64 bytes in some cases). Need some sleep first, but its an interesting one to look into.
There was a problem hiding this comment.
@ViezeVingertjes any additional thoughts on this? Happy to make revisions following Scott's suggestion, but I wanted your feedback because the KISS firmware is your baby.
I can confirm that this change, plus a couple changes on the pyMC side have completely eliminated the lockup and unresponsive frame issues that we were facing (which is super exciting).
Problem
On ESP32-S3 boards (Heltec V4, Station G2), the KISS modem would stop responding after a while and stay dead—
SetHardwarePING (0x17) timed out and the condition survived a host service restart, forcing the pyMC_repeater client into no-radio mode. nRF52 (RAK4361) on identical firmware was unaffected.Root cause
The ESP32 modem ran over Arduino TinyUSB
USBCDC(ARDUINO_USB_MODE=0), which wedges under TX backpressure:USBCDC::write()busy-spins indefinitely while "connected" (itstx_timeout_msonly guards the lock acquire, not the send loop), and RX events post to a 5-deep queue withportMAX_DELAY. Sinceloop()is single-threaded and the ESP32-S3 has no DTR/RTS→EN reset, nothing recovers it. nRF52 is immune (hardware UART, non-blocking FIFO writes).Confirmed by a deterministic repro (flood the modem while never reading replies → permanent wedge) and a heartbeat LED that kept blinking while the data path was frozen, isolating the hang to the TinyUSB layer, below app code.
Fix
Switch the ESP32 KISS envs to the ESP32-S3 USB-Serial-JTAG peripheral (
HWCDC,ARDUINO_USB_MODE=1), whosewrite()is bounded (tries/timeout countdown, not an infinite spin) and which posts RX from ISR.variants/{heltec_v4,station_g2}/platformio.ini:build_unflagsthe board default=0, set=1(a bare-Uis reordered after the-Dby SCons).examples/kiss_modem/: transport-neutral hardening—non-blocking writes that drop a frame instead of stallingloop()when the TX buffer is full, plussetTxTimeoutMs(). A dropped reply is harmless; the host retries.Validation
heltec_v4_kiss_modem,Station_G2_kiss_modem,RAK_4631_kiss_modem.nm:HWCDClinked, zero TinyUSB symbols.A Wrinkle
ARDUINO_USB_MODE=1changes the USB descriptor, so the/dev/serial/by-id/path becomesusb-Espressif_USB_JTAG_serial_debug_unit_<MAC>-if00.